Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes the issue described in #338 where the
__dp4a
based CUDA kernels cause the model to produce garbage. The problem is that the cmake CUDA compute capabilities set in llama.cpp and ggml are different. In llama.cpp they are set to52;61
, the lowest allowed compute capability and the minimum compute capability for__dp4a
. A GPU will automatically use the highest compute capability PTX code that was compiled. If only 5.2 PTX code is generated (the default) the fallback implementation in which 0 is returned is used for GPUs with compute capability >= 6.1 which causes the model to produce garbage outputs. Ideally I would have put the__CUDA_ARCH__
check outside the kernels but unfortunately this is not possible;__CUDA_ARCH__
is only available in device code.